Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: util/tracing: fix crash in StreamClientInterceptor #80911

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented May 3, 2022

This is a backport of parts of #80878. Compared to that patch, this one leaves the finalizer in place, to not rock the boat before the 22.1 release.

Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:

  1. ClientStream.Context() pins the RPC attempt to a lower-level
    connection, and inhibits gRPC's ability to sometimes transparently
    retry failed calls. In fact, there's a comment on ClientStream.Context()
    that tells you not to call it before using the stream through other
    methods like Recv(), which imply that the RPC is already "pinned" and
    transparent retries are no longer possible anyway. We were breaking
    this.
  2. One of the grpc-go maintainers suggested that, due to the bugs
    reference above, this call could actually sometimes give us "the
    wrong context", although how wrong exactly is not clear to me (i.e.
    could we have gotten a ctx that doesn't inherit from the caller's ctx?
    Or a ctx that's canceled independently from the caller?)

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.

Release justification: release blocker

Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.
@andreimatei andreimatei requested review from dhartunian and a team May 3, 2022 15:06
@blathers-crl
Copy link

blathers-crl bot commented May 3, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @dhartunian)


pkg/util/tracing/grpc_interceptor_test.go line 140 at r1 (raw file):

	c := grpcutils.NewGRPCTestClient(conn)
	unusedAny, err := types.MarshalAny(&types.Empty{})
	require.NoError(t, err)

I noticed that the "make tests independent" part of the diff is not here. Is that related to the finalizer since it's checking for leaked spans?


pkg/util/tracing/grpc_interceptor_test.go line 230 at r1 (raw file):

	// the span. Nothing else closes that span in this test. See
	// newTracingClientStream().
	runtime.GC()

If the finalizer is still in place, does the GC need to be triggered?

@dhartunian dhartunian self-requested a review May 3, 2022 16:19
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/util/tracing/grpc_interceptor_test.go line 140 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I noticed that the "make tests independent" part of the diff is not here. Is that related to the finalizer since it's checking for leaked spans?

I simply haven't backported that commit, to keep the backport more focused. That test change is independent of the rest. It makes the test fail better in cases where we have span leaks.


pkg/util/tracing/grpc_interceptor_test.go line 230 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

If the finalizer is still in place, does the GC need to be triggered?

no; the finalizer stays in place in case we have leaks of the gRPC streams and their spans (which I don't think we have). But this test no longer needs the finalizer because the test no longer leaks streams. Before this patch, the test was violating the gRPC contract in the UnaryStream and StreamStream case because it was not fully consuming the stream - we were just Recv() ing once and then abandoning the stream, but you're supposed to read until you get an error, or cancel the call's ctx. This patch brings the test to the RPC spec.

@andreimatei andreimatei merged commit 0a7883a into cockroachdb:release-22.1 May 4, 2022
@andreimatei andreimatei deleted the backport.tracing.fix-stream-interceptor branch May 4, 2022 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants